+ - 0:00:00
Notes for current slide
Notes for next slide

Advanced R for Econometricians

Advanced R Concepts

Martin C. Arnold, Jens Klenke

1 / 43

Overview

Make sure the lobstr package is attached!

install.packages('lobstr')
library(lobstr)

Outline

  • Names and values

    • Bindings and references

    • Copy-on-modify / in-place modification

    • Unbinding and Garbage Collection

  • Functions

    • Fundamentals, scoping and Lazy Evaluation

    • Function forms

2 / 43

Names and Values

3 / 43

Assignment Operators

  • <- is often used for assignment but some people use = instead.

  • There is, however, a subtle difference in how they are evaluated when mixed in the same expression.

Example: operator precedence

a <- b <- 1
a == b
## [1] TRUE
a = b = 1
a == b
## [1] TRUE
a = b <- 1
a == b
## [1] TRUE
a <- b = 1
## Error in a <- b = 1: could not find function "<-<-"
4 / 43
  • This really is just a convention and nothing precludes using = instead of <- for assignment

  • When mixed, <- has precedence over =

  • Fact: <- comes from a time where there actually was a <- key on keyboards. <- and -> essentially do the same thing.

  • R interprets a <- b = 1 as '<-<-'(a, b = 1, value = 1)

Assignment Operators

  • For consistency we use <- for binding and = for assigning objects to function arguments.

  • Note, however, that there are reasonable proposals for using other conventions.


Task:

Find out what -> does and think of an application where it might be useful.

Hint: Experiment to find out about the precedence relation between <-, -> and =.

5 / 43
  • -> is the right-assignment operator

  • The precedence relation is -> >> <- >> = so x <- 1 -> b is another working alternative to the last line on the previous slide.

Assignment Operators

Assignment operators work in the environment they are invoked in. The super assignment operator <<- assigns in the enclosing environment, provided the binding exists there.

Example: super assignment

var_GE <- 1
a <- function(x) {
b <- function(x) var_GE <<- x
b(x)
}
a(3)
var_GE
## [1] 3
6 / 43
  • Note that functions generate their own environments upon execution. The execution environment of a() is the parent environment to b().

  • If <<- does not find a corresponding binding in the enclosing environment, it looks in the parent environment and works it's way up towards to global environment (GE)

  • If the binding does not exist in the GE, it will be created there

Bindings

  • Knowing what assignment does internally is crucial for understanding performance and memory usage of your code and R's functional programming tools.

  • What happens if we define a vector x? The idiom 'the object x stores the vector' is not quite right...

Example: binding a vector

Binding means that the name has a value: x is a reference to a value living in the computer's memory.


Source: Wickham (2019)

x <- c(1, 2, 3)


y <- x
7 / 43

Bindings — Character Vectors

A character vector is a binding to a vector of strings.

Example: binding a character vector


Source: Wickham (2019)

x <- c("a", "a", "abc", "d")
ref(x, character = TRUE)
## █ [1:0x10900f248] <chr>
## ├─[2:0x12b2ced78] <string: "a">
## ├─[2:0x12b2ced78]
## ├─[3:0x10495fc88] <string: "abc">
## └─[4:0x12b505670] <string: "d">
8 / 43

Copy-on-modify

R's copy-on-modify behavior is both blessing and curse:

  • We may use references without the risk of breaking existing code (convenient).

  • modifying a reference may trigger a copy of the value (unfavourable).

Example: copy-on-modify


Source: Wickham (2019)

x <- c(1, 2, 3)
y <- x
y[[3]] <- 4
x
## [1] 1 2 3
obj_addr(x)
## [1] "0x10900fbf8"
obj_addr(y)
## [1] "0x109a51258"
9 / 43
  • This is very different for many other languages, including c++ which we will see later during the course.

  • Question: object addresses (like 0x7f9ef3059d38) will be different (unpredictable) if the code is re-run. Why?

Copy-on-modify

We may use tracemem() to obtain info when a copy of an object is generated.

Example: keeping track of copies using tracemem()

tracemem() returns the copied object, the new address (and the call stack, if functions are involved).

x <- c(1, 2, 3)
tracemem(x)

[1] "<0x7fba7740ebb8>"

y <- x
y[[3]] <- 4

tracemem[0x7fba7740ebb8 -> 0x7fba76883958]:

y[[3]] <- 5
# stop tracking
untracemem(x)
10 / 43

Simply put, a call stack describes the order of nested function calls.

Copy-on-modify — Function Calls

The above rules apply to function calls as well.

Example: keeping track of copies using tracemem() — ctd.



Source: Wickham (2019)

f <- function(a) a
x <- c(1, 2, 3)
tracemem(x)
## [1] "<0x74b050b21318>"
z <- f(x) # no copy here!
untracemem(x)
11 / 43

No copy because z is just a reference to the value of x.

Copy-on-modify — Function Calls



Task:

Predict what tracemem() returns if the highlighted lines get executed, respectively.

f <- function(a) {
a[[1]] <- 0
a
}
x <- c(1, 2, 3)
tracemem(x)
f(x)
z <- f(x)
untracemem(x)
12 / 43
  • f(x): f() binds a to the same memory location x points to during execution. In Contrast to the previous slide, a copy is made since f() modifies a.

  • z<-f(x): as above. z is a binding the same location as a (no additional copy made).

  • Q: What would happen if we'd super assign to x inside of f()? A: no additional reference to x, so calling f() would not trigger a copy.

Copy-on-modify — Lists

Lists are special: list elements are references to values.

Task:

Which (sequentially called) statement generate the results shown in each diagram? What is special about the bottom one?

Source: Wickham (2019)

13 / 43
l1 <- list(1, 2, 3) # left
l2 <- l1 # right
l2[[3]] <- 4 # bottom

Note that copy-on-modify due to l2[[3]] <- 4 results in a shallow copy: the bindings are copied, not the values. ⇒ performance considerations!

Copy-on-modify — Lists


You may check your predictions using lobstr::ref().

ref(l1, l2)
## █ [1:0x13b0c5768] <list>
## ├─[2:0x106033f20] <dbl>
## ├─[3:0x106033ee8] <dbl>
## └─[4:0x106033eb0] <dbl>
##
## █ [5:0x13b3404a8] <list>
## ├─[2:0x106033f20]
## ├─[3:0x106033ee8]
## └─[6:0x106033dd0] <dbl>
14 / 43

Copy-on-modify — Data frames

A data frame is essentially a list whose elements point to vectors.

Example: data.frame()

Source: Wickham (2019)

df <- data.frame(
x = c(1, 5, 6),
y = c(2, 4, 3)
)

Task:

Explain why modifying rows is generally more costly than modifying columns for data frames.

15 / 43

A: modifying a single row implicates modifying all columns (this cannot be tracked by tracemem() but can be seen using lobstr::ref())

df[1, ] <- c(42, 42)
ref(d1)
# vs.
df$x <- c(7, 7, 7)
ref(d1)

Exercises

  1. Why is tracemem(1:10) not useful?

  2. Explain why tracemem() shows two copies when you run this code.
    Hint: carefully look at the difference between this code and the code on Slide 11.

    x <- c(1L, 2L, 3L)
    tracemem(x)
    x[[3]] <- 4
  3. Explain the below results.

    obj_size(1:10)
    ## 680 B
    obj_size(1:1e6)
    ## 680 B
16 / 43
  1. 1:10 is no binding so there's no point in tracing a value.

  2. Here x is an integer vector (x has double type as defined on Slide 11). Replacing 3L with 4, i.e. integer with double, triggers coercion of x to double. This always results in a copy.

  3. : is special in the sense that it generates integer sequences using only the first and the last element. The number of elements thus doesn't affect the required memory.

Modify-in-place

R modifies in-place in two cases:

  1. The object has only one binding

  2. The object is an environment

Example: optimised modification

v <- c(1, 2, 3)
obj_addr(v)
## [1] "0x7fcce1ca03d8"
v[[2]] <- 4
# check that v still points to the same memory location
obj_addr(v)
## [1] "0x7fcce1ca03d8"
17 / 43
  • Q: Running this code in RStudio will trigger a copy. Why?

    A: An entry in the Environment tab is a binding, i.e., there are more than one (two) references to v which triggers a copy!

    You need to run the code in the GUI version of R for reproducing the results.

    (using the R-GUI for this purpose is generally a good practice!)

  • Note that tracemem() does not play well with knitr

Case Study: Copy-On-Modify Inferno

Whether or not R copies an object—and if so, how often—can be hard to predict.


Example: loop modification of data frame (please don't!)

You should never modify a data frame in a loop.

x <- data.frame(
matrix(runif(5 * 1e4),
ncol = 5)
)
medians <- vapply(X = x, FUN = median, FUN.VALUE = numeric(1))
tracemem(x)
for (i in seq_along(medians)) {
x[[i]] <- x[[i]] - medians[[i]]
}
18 / 43
  • Note that vapply()'s FUN.VALUE requires a template for the return value (which is numeric 1×1 here)

  • We will come back to consequences of this behavior in the Chapter Improving Performance and benchmark against alternatives that require less copies.

Case Study: Copy-On-Modify Inferno

[[<-.data.frame is revealed to be quite expensive.

tracemem[0x7fe1aed76628 -> 0x7fe1ad4dc428]:
tracemem[0x7fe1ad4dc428 -> 0x7fe1ad4dc578]: [[<-.data.frame [[<-
tracemem[0x7fe1ad4dc578 -> 0x7fe1ad4dc658]: [[<-.data.frame [[<-
tracemem[0x7fe1ad4dc658 -> 0x7fe1ad4dc7a8]:
tracemem[0x7fe1ad4dc7a8 -> 0x7fe1ad4dc8f8]: [[<-.data.frame [[<-
tracemem[0x7fe1ad4dc8f8 -> 0x7fe1ad4dcb98]: [[<-.data.frame [[<-
tracemem[0x7fe1ad4dcb98 -> 0x7fe1ad4dcdc8]:
tracemem[0x7fe1ad4dcdc8 -> 0x7fe1ad4dd068]: [[<-.data.frame [[<-
tracemem[0x7fe1ad4dd068 -> 0x7fe1ad4dd308]: [[<-.data.frame [[<-
tracemem[0x7fe1ad4dd308 -> 0x7fe1ad4dd4c8]:
tracemem[0x7fe1ad4dd4c8 -> 0x7fe1ad4dd618]: [[<-.data.frame [[<-
tracemem[0x7fe1ad4dd618 -> 0x7fe1ad4ddae8]: [[<-.data.frame [[<-
tracemem[0x7fe1ad4ddae8 -> 0x7fe1acdbda28]:
tracemem[0x7fe1acdbda28 -> 0x7fe1acdbe198]: [[<-.data.frame [[<-
tracemem[0x7fe1acdbe198 -> 0x7fe1acdbe6d8]: [[<-.data.frame [[<-
19 / 43

Using the $ operator makes no difference: $<- also has a $<-.data.frame method. The output of tracemem() then looks similar to this:

tracemem[0x7fe515db8d88 -> 0x7fe513bcd788]:
tracemem[0x7fe513bcd788 -> 0x7fe513b79a88]: $<-.data.frame $<-
tracemem[0x7fe513b79a88 -> 0x7fe513b79c88]: $<-.data.frame $<-

Case Study: Copy-On-Modify Inferno

What's happening here and why?

  • x is referenced more than once:

    • global environment
    • inside of [[
    • inside of [[<-.data.frame()

    → modification will result in a copy

  • The following runs inside [[:

    `*tmp*` <- df
    df <- `[[<-.data.frame`(`*tmp*`, <additional arguments>)
    rm(`*tmp*`)
    • The additional binding to *tmp* results in a copy

    • [[<-.data.frame changes the class and a component of x (two additional copies)

20 / 43
  • The two copies made by [[<-.data.frame are shallow copies (only column references are copied)

  • Q to students: What kind of function is [[<-.data.frame?

    A: A regular function. It is a method of [[<- which is a primitive (a fast C function)

  • Q to students: How can you view the source?

    A: `[[<-.data.frame`

  • More on primitives on the next slides. More on methods, dispatch etc. in the 'OOP' Chapter.

  • More on how to write efficient code (and especially efficient for() loops) in Chapter 'Improving Performance'

Case Study: Copy-On-Modify Inferno

It's better to use a list for this purpose.

y <- as.list(x)
tracemem(y)
for (i in 1:5) {
y[[i]] <- y[[i]] - medians[[i]]
}
tracemem[0x7fba72971928 -> 0x7fba72a2e178]:

Before loop:

█ [1:0x7fba72971928] <named list>
├─X1 = [2:0x10f1ec000] <dbl>
├─X2 = [3:0x115c56000] <dbl>
├─X3 = [4:0x115c6a000] <dbl>
├─X4 = [5:0x115c7e000] <dbl>
└─X5 = [6:0x115c92000] <dbl>

After loop:

█ [1:0x7fba72a2e178] <named list>
├─X1 = [2:0x115cc7000] <dbl>
├─X2 = [3:0x115ca6000] <dbl>
├─X3 = [4:0x10f07e000] <dbl>
├─X4 = [5:0x10ef83000] <dbl>
└─X5 = [6:0x10f056000] <dbl>
21 / 43
  • A single copy is made from internal C code the first time we use [[<-

  • This a good example where tweaking the code reduces the amount of copies made

  • If such a solution is not readily at hand we may resort to C++ code. More on this in the Rcpp chapter.

Modifying Lists

What happens if we modify list entries is better understood from the following example.


Example: modifying a list

Can you explain what's going on?

# step 1
x <- list(1:10)
lobstr::ref(x)
# step 2
x[[2]] <- x
lobstr::ref(x)
## █ [1:0x109555ad0] <list>
## └─[2:0x109b1d780] <int>
## █ [1:0x13b38a248] <list>
## ├─[2:0x109b1d780] <int>
## └─█ [3:0x109555ad0] <list>
## └─[2:0x109b1d780]
22 / 43
  • x is assigned to itself (an additional reference) so a copy on modification is made

  • The the old memory location of x has no binding anymore but is referenced within x

  • Note that lists are always copied on modification. The copy is, however, shallow .

Garbage Collection

Ubiquitous operations that are reflected in the RStudio's Environment tab are 'unbind' and 'delete'. What does actually happen if we alter a name or even 'delete' the object from the (global) environment?

Example: unbinding an object

(a) binding

x <- 1:3

(b) implicit unbinding

x <- 2:4

(c) explicit unbinding

rm(x)
23 / 43

(b) stresses why it's wrong to think of x to 'store' anything different than an address.

Garbage Collection — Quick Facts

  • R uses a tracing Garbage Collector (GC): it keeps track of objects in the global environment and references therein.

  • The GC runs automatically if space is needed for creating new objects. There is no need to actively force garbage collection. You can, however, do so by calling gc() with the side effect of obtaining info on memory occupation (there's also a button in RStudio for this).

  • You may run gcinfo(TRUE) if you wish to be informed when the GC runs

Example: garbage collection

gc() # just for the side effect
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 1730109 92.4 3013450 161.0 NA 3013450 161.0
## Vcells 4099683 31.3 10146329 77.5 32768 10146329 77.5
mem_used() # only total memory usage, but more exact
## 129,680,408 B
24 / 43
  • The only reason to call gc() is if you need to free-up memory for your operating system — which will hardly ever happen

  • vcells = memory used by vectors; ncells = memory used by anything else

  • The 'large numbers' report cells used (8 byte each)

  • lobstr::mem_used() does not agree with what's reported by your OS: there are other objects (generated by, e.g., the R interpreter) which are not captured

Functions

25 / 43

Functions

To understand computations in R, two slogans are helpful:

Everything that exists is an object. Everything that happens is a function call. John Chambers
26 / 43

Regular Functions vs. Primitives

  • Regular functions (closure type) live in environments and consist of a body along with formals

  • Primitives are special base R functions that call C code

Example: regular functions vs. primitives

typeof(lm)
## [1] "closure"
environment(lm)
## <environment: namespace:stats>
names(formals(lm))[1:4]
## [1] "formula" "data" "subset" "weights"
# body(lm) # too large, feel free to check!
typeof(sum)
## [1] "builtin"
environment(sum)
## NULL
names(formals(sum))
## NULL
body(sum)
## NULL
27 / 43

First-Class Functions

R functions are objects! This means we may do stuff that may seem exotic when compared to languages like C and Python that demand very explicit definitions and are much more restrictive.

Example: fun with anonymous functions

funs <- list(
function(x) x^2,
function(x) x^3
)
lobstr::ref(funs)
## █ [1:0x109a1bf88] <list>
## ├─[2:0x1058bc0b8] <fn>
## └─[3:0x1058bbfd8] <fn>
sapply(funs, function(z) z(5))
## [1] 25 125
(function(x) x^2)(5)
## [1] 25
(function(x) x^3)(5)
## [1] 125
28 / 43

Obviously, R functions are objects on their own right — they need not be bound to a name!

First-Class Functions



Task:

Explore what kind of function ( is, what it does, and explain why the statements

(function(x) x^2)(5)

and

(x<-5)
## [1] 5

are meaningful to R.

29 / 43
  • ( is a primitive:

    `(`
    ## .Primitive("(")
  • The R-help hints that ( is semantically equivalent to function(x) x.

    E.g., (x<-5) is valid (and useful) R code!

Lexical Scoping

Scoping refers to the routine of finding the value associated with a name. R's scoping mechanism follows four concepts you should already be familiar with. We summarise them briefly here.

  1. Name Masking:

    names defined inside a function mask names defined outside of it.

  2. Functions before variables:

    functions and objects in different environments may share the same name. R ignores non-function objects in function calls.

  3. Execution Environments:

    functions generate ephemeral environments.

  4. Dynamic Look-up:

    R searches for values when the function is run (and not when it's created).

30 / 43

Lexical Scoping

Task:

Write a code snippet which is useful for demonstrating all of the above concepts.

31 / 43
rm(x)
z <- function(x) x^2
# 3. everything in f() happens in an ephemeral environment
f <- function(g) {
if(!exists("x")) {
x <- 1
} else {
x <- x + 1
}
z <- 2 # 1. name masking
z(x + y) # 2. functions before variables
}
y <- 20
f(x) # 4. dynamic lookup
## [1] 441

Lazy Evaluation — Promises

Lazy evaluation allows R functions to behave quite differently than functions in most other languages and it's important to understand what is special about that.

  • A promise consists of an expression along with an environment and a value which is computed and cached the first time the promise is accessed

  • We implicitly use promises in functions via lazy evaluation. Here we refer to unevaluated arguments as promises.

Example: outside evaluation

f1 <- function(x) { y <<- 5; x + 1 }
f1(x = y <- 6)
## [1] 7
y
## [1] 6
32 / 43
  • Explanation: f() evaluates its argument x when it's needed: at runtime, y<-5 is bound outside of f() in the GE. Once x+1 needs to be computed, y<-6 is evaluated (which overwrites y in GE).

  • Loading a data set using data() uses a promise. See, e.g., data(AirPassengers).

  • The style shown in the example used by many base R function but it's not recommended since its hard to understand what's going on.

Lazy Evaluation

Example: laziness

double <- function(x) {
message("Computing...")
x * 2
}
clone <- function(x) {
c(x, x)
}
clone(double(20))
## Computing...
## [1] 40 40
33 / 43

We see that double(20) is evaluated only once: there's only one message printed. This is when c() inside of clone() looks for x for the first time.

Lazy Evaluation

Example: lazy evaluation of (default) function arguments

f <- function(x) {
cat("f: 'x doesn't matter to me.'")
}
f(x = stop("I don't matter."))
## f: 'x doesn't matter to me.'
# (default arguments)
f <- function(x = 1, y = x * 2, z = a + b) {
a <- 10
b <- 100
c(x, y, z)
}
f()
## [1] 1 2 110
34 / 43
  • Although we do not recommend this style, it reflects great flexibility due to lazy evaluation

  • Note that this approach may be useful when a default argument is computationally expensive to evaluate and not needed in every call to f() (this usually happens when the argument is a more complex function call than '+'())

Lazy Evaluation

Task:

ls() lists objects in the environment where it is called. Explain the results below.

f <- function(x = ls()) {
a <- 1
x
}
f()
## [1] "a" "x"
f(ls())
## [1] "desktop" "f"
35 / 43
  • Due to lazy evaluation, the evaluation environment for default arguments is the function environment

  • User supplied arguments are evaluated in the parent environment (GE here)

Lazy Evaluation

Lazy evaluation also applies to other situations, e.g., in control flow.

x <- NULL
if (!is.null(x) && x > 0) {
# <do something>
}

Task:

  1. Which part of the code seems problematic at first sight?

  2. Give an explanation for why the statement above does not result in an error.

36 / 43
  • NULL represents the null object in R. NULL is used mainly to represent a list with zero length, and is often returned by expressions and functions whose value is undefined.

  • Without lazy evaluation this statement would throw an error because x > 0 evaluates to a logical value of length zero (you cannot compare NULL to double)

  • Control flow stops after evaluating the first part of the condition in if(): the second statement would be evaluated only if the first is TRUE (here it is FALSE)

Exercises

  1. Explain why the following code does not work. Can you come up with a work-around without altering f()?

    f <- function(x, z) {
    z + x^2
    }
    f(x = z^2, z = 2)
    ## Error in f(x = z^2, z = 2): object 'z' not found
  2. Explain what the ... argument (ellipsis) does by means of the following example:

    f <- function(...) {
    names(list(...))
    }
    f(a = 1, b = 2)
    ## [1] "a" "b"
37 / 43
  1. Lazy evaluation happens at function definition, not invocation: R looks for z in the global environment because x is not z^2 by default, like in the fixed version below.

    f <- function(x = z^2, z) {
    z + x^2
    }
    f(2, 2)

    Workaround using with:

    with(list(z = 2), f(x = z^2, z))

  2. ... enables us to pass arguments the body of f() (to other functions!) which do not have to be specified at definition of f().

    Here, f() is a simple wrapper that returns names of list elements as passed to the ... argument.

Function Forms

  • Not all function calls look the same:

    • prefix: f(a, b)

    • infix: a + b. Also ==, <-, ::, ...

    • replacement: names(c) <- c("x", "y")

    • special: for, [[, if, ...

  • Every function call can be written in prefix form!

    `[`(1:3, 3)
    ## [1] 3
38 / 43
  • @infix forms: user defined operators (which always begin and end with %) belong to this class.

  • NB: is doesn't matter whether one uses `` or '' in prefix form

Function Forms

Example: rewrite special and infix as prefix

# addition
1 + 2
## [1] 3
`+`(1, 2)
## [1] 3
# integer sequence generation
x <- `:`(1, 10)
x
## [1] 1 2 3 4 5 6 7 8 9 10
# evaluation
`(`(x)
## [1] 1 2 3 4 5 6 7 8 9 10
39 / 43

Function Forms — User defined function in infix form

It's straightforward to write your own operators in infix form.

Example: paste0() as infix

`%+%` <- function(a, b) paste0(a, b)
"new " %+% "string"
## [1] "new string"
40 / 43

Function Forms — Replacement Functions

Replacement functions have the form shown below.

`some_name<-` <- function(x, value) {
<do something (i.e, modify x)>
return(x)
}

Example: replacement of the last vector element

`last<-` <- function(x, value) {
x[length(x)] <- value
x
}
x <- c(1, 2, 3)
last(x) <- 99
x
## [1] 1 2 99
41 / 43
  • Note that replacement functions must have x and value as arguments and return the modified object x

  • Additional arguments may be passed between x and value

Function Forms — Replacement Functions

Replacement functions are very convenient but there is no free lunch:

Replacements always trigger copies!

Example: replacement of last vector element — ctd.

tracemem(x)
## <0x7feac1eb7598>
last(x) <- 420
## tracemem[0x7feac1eb7598 -> 0x7feaa7ce3908]:
## tracemem[0x7feaa7ce3908 -> 0x7feaa7ce68c8]: last<-
x
## [1] 1 2 420
42 / 43

tracemem() reports two copies: the first occurs because last<- creates a copy inside its own environment before modification and R runs

x <- `last<-`(x, 420)

under the hood.

Thank You!

43 / 43

Overview

Make sure the lobstr package is attached!

install.packages('lobstr')
library(lobstr)

Outline

  • Names and values

    • Bindings and references

    • Copy-on-modify / in-place modification

    • Unbinding and Garbage Collection

  • Functions

    • Fundamentals, scoping and Lazy Evaluation

    • Function forms

2 / 43
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow